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Abstract 

A systematic study of tlie probability distribution of superimposed random codes is presented through the use of generating 
- - - functions. Special attention is paid to the cases of either uniformly distributed but not necessarily independent or non uniform but 

oo : independent bit structures. Recommendations for optimal coding strategies are derived. 

o '■ 

^ ' I. Introduction 

^ ^ Chemical structure retrieval systems are frequently presented with the task to produce a list of all stored chemical graphs 
containing a prescribed subgraphflT], (|2l. Due to the absence of a linear order among the stored data tree based search 
strategies fail, and a sequential search has to be performed. To accelerate this time consuming process, the actual graph 
theoretical substructure match is preceeded by prescreening: the entire database is matched against a library of simple but 
common descriptors, and the validity of descriptors is recorded in a bitstring for each stored structure. Suitable choices for 
I""! descriptors are small chemical subgraphs containing only few vertices, graph diameters, ring sizes or any other property that 
passes from subgraphs to supergraphs. When a query structure is submitted to the system, the descriptors are evaluated for 
this query structure resulting in a query bit string. Only those stored structures are candidates for a match where each bit is 
turned on in all those positions where the query bits are turned on and will be subjected to the expensive graph theoretical 
O matching algorithm. 

For example, let us consider the compounds in table I] A chemist might ask for a list of all structures in our database 
ps) containing 2-(cyclohexylmethyl)naphthalene, which is too complex to be one of the index descriptors. However, any matching 
J> structure must necessarily contain cyclohexane and naphthalene, and these might be indexed. We will produce an intermediate 
result set that also contains 2-(2-cyclohexylethyl)naphthalene, which is not in accordance with the original query specification 
and must be singled out by graph matching. 

The Beilstein database of organic compounds contains around 10 million structures, each a chemical graph of up to 255 
vertices, and the number of descriptors will have the magnitude of one thousand. With such characteristics, the preevaluated 
bitstrings will consume a considerable amount of storage and hence of processing time. On the other hand, one may expect 
the 1-bits to be relatively scarce, whence it should be possible to compress the bitstrings without losing too much information. 
Thus, we are looking for a map ip : ^ / = {0, 1} transforming bitstrings of length N into strings of length n. However, 
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we have to observe the partial order relation (3 < (3' between bitstrings defined such that the bits should satisfy /3(i) < 
at all positions i. Since we want to use the compressed strings in the same manner (by bitmasking) as the original ones the 
transformation must be monotone: /3 < =i> il]{(3) < ?A (/?') and in particular V -0 {P') < ?A (/? V /3'), where the wedge 
denotes the bitwise inclusive or operation. If the latter inequality was strict information would be lost, hence we claim: 

V'(/3)VV(/3') =V'(/3V/3') (1) 

for any two source strings G . This is the defining condition of superimposed coding. If we denote by fij £ the 
elementary string having bit 1 in position j and elsewhere and set ijjj :— ip (/3j) G /" equal to the code word assigned to 
bit j, then we must have 

explaining the term. Here is the inverse image of 1, i.e. the set of all positions i where the i-th bit f3{i) is turned on. 

As summary we emphasize: in our context superposition is not a matter of choice but a requirement. 

We speak of superimposed random coding if the code words ijjj are constructed using a random number generator. Non 
random approaches have been considered e.g. in ||3|, (|4l and perform very well when the number of code words superposed 
in equation is bounded, but are unpredictable beyond this barrier. The random approach was probably initiated by fS), but 
there an invalid probability analysis was given. A thorough study of the case where source bitstrings and code words are of 
fixed weight was given in (|6l. Bloom filters (cf. Q and E] p.572]) constitute an application of this technique where the primary 
keys of a database are encoded in a bitstring; it should be noted that the approach favored by us produces a twodimensional 
bit array for the entire database, one bitstring for each record. Despite the continuing popularity of the subject in context of 
chemical substructure search a broader systematic approach seems to be lacking, a gap that is intended to be filled by the 
current paper. One question that has to be addressed is the optimal choice of codes. It is known at least since Roberts' paper 
[6 1 that the target bitstrings should contain a more or less even balance of and 1-bits, but if we want to achieve this other 
than by try and error we must study the distribution of target bits. Of particular practical importance is the situation where 
the source bits are turned on with non uniform probability. A few statements hereabout are by Roberts in |6| but are based 
on intuition, not computation, and in fact we disagree with his conclusion. In |9| the author gives recommendations about the 
selection of descriptors, whereas we set ourselves the task of adapting the coding design to a given set of descriptors. 

We must notice that the notion "random coding" has an inherent semantic difficulty. It must be understood clearly that coding 
and decoding of a bit pattern have to be performed within one and the same run of the code generation random experiment. 
After each coding-decoding cycle the code generation should be repeated independently. However, this procedure does not 
exactly fit when recording bit patterns in databases: here the codes must be fixed once and for all. Since a large set of TV 
codes is needed, we might expect good statistical behavior in this context too. 

The observant reader will have noticed that equation (|2]l allows for a target string ipiP) to be covered by a comparison 
pattern, ^ V'It)^ even if the source patterns do not cover: /? ^ 7. This will result in the inclusion of fake hits in our 

result sets. As long as their number is small this is acceptable, because the bit comparison will be followed by graph matching 
anyway. However, we want to predict and minimize this number. 

In section|lI]we will develop the probabilistic tools necessary to manage our random bit strings. In section Hill we demonstrate 
how to compute the distribution of the target bits from those of source and code bits, and binomially distributed and fixed 
weight code words are introduced as primary examples for code generation. This settles the case of uniform but not necessarily 
independent source bit distributions. The requirement of uniformity will be dropped in section |IV] but the assumption of 
independence will be added. The completely general case will be addressed in section |V] 

II. Isotropic bit distributions 
Let's consider a fixed source bit pattern (3 E and a fixed target bit pattern a G /"; then 

P{^P{P)<a) - P {yj e : i;, < a) (3) 

= n ^(^.^■^«)' (4) 

j6/3-i(l) 

where the probability is that of the code generation. Now varying the source pattern /3 independently we obtain an expression 
for the probability of target patterns: 

P (V'(-) <a)^ ^(^) n P ^ ■ 

Definition 1: We call a probability distribution on /" isotropic, if it depends only on the number of 1-bits but not on their 
position. 



This means that the distribution is invariant under all coordinate permutations (In ifTOl the term homogeneous is used). As 
is evident from equation (|5]), the distribution of the target patterns is isotropic if the code generation is, even if the source 
patterns are non isotropically distributed. Since this observation leads to a considerable simplification of our analysis we now 
give a short exposition of isotropic distributions. 

We are given numbers po, . . .pn > with J2'k=o {h)Pk ~ ^ such that the probability of a particular bit pattern a G /" is 
given by P{a) = pa with a := ^a^^{l). Our main analytical tool will be the probability generating function 

m E hp.t", (6) 
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this being the standard definition obtained by considering the number of 1-bits as random variable. For fixed a G /" with 

a := #a^^(l) we define 

Fa := ^(e<«)-E(fc)p''- (7) 

Ga P(i>-)-j:(l'%- (8) 

k=a ^ ^ 

Of course the quantities Fa and G„_a are dual to each other, in fact the one is transformed into the other by switching and 
1-bits. They play very different roles for us, though: notice the occurrence of Fa on the right hand side of equation (|5]l. Ga is 
the relative number of candidates that will be selected by prescreening with bit pattern a, because the condition (, > a singles 
out precisely those patterns ^ that have 1-bits in the same positions as a, and probably some more. We define generating 
functions as follows: 

F{t) V f^'V™*" (9) 
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G{t) := EU'^"-''^ ^^^^ 



The following relations are readily established: 



F{t) = (11) 
G{t) = (-l)"F(-l-t) (12) 
G{t) = *"/(^) (13) 

m , ^ 

fe=0 ^ ^ 

m ^ s 

^™ = E(-i)'(rj^"-fc (15) 

The three sets of coefficients pm, Fm and Gm carry the same information and can be readily converted into each other using 
the above relations. The reader will certainly notice the difference between (fTTT i and the standard situation ifTTl Thm. XI. 1], 
which arises from the fact that the values a appearing in definition (|7]i are partially but not linearly ordered. 

For later reference we need the derivatives of / at 1. By ^ we have /(I + e) = e'^G {\) = YJI^q iDGks''. This allows 
to read off the Taylor coefficients of the function / at the point 1 at once: 

'"•'»-'-'G, (16, 
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We define the moments fim — J2k=o^™Ck)P'' °^ distribution as usual and remember their generating function ifTTl 
Exc. XI.7.24]: 

OO 

/(^*) = ES*'"- (17) 
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Equations (fT6b and ( fTTb allow to express the first two moments in terms of the distribution coefficients Fm- 

Hi=n{l- Fr,-i)=nGi (18) 
^i2=n[n- {2n - l)F„_i + {n - 1)F„_2] 
= n[{n-l)G2 + Gi]. 



The variance evaluates to 

H2-f^l^n [F„_i - nF^_^ + {n - l)i^„_2] , (20) 

whence in particular: 

Fn-2 >{n~ l)-^ [nF^-i - 1) F„_i. (21) 
Two examples are of primary importance in our context: 

Example 1: The binomial distribution p,n = (1 ^ q^m^n-m ^jjjj parameter 1 — g. Here we have: 

/(t) = {q+{\-q)tT (22) 

F{t) = {q + tr (23) 

G{t) = {l-q + tr (24) 

Fra = g""™ (25) 

G„, = (1-9)" (26) 

Ml = "(1-q) (27) 

= n(l - <z) Kl - 9) + 9] (28) 

/i2-M? = nq{\-q). (29) 

Example 2: Fixed weight. Here only bit patterns with a fixed number w of 1-bits are permitted. We get: 



w 




(30) 



f{t) = r (31) 

F{t) = (l + i)"-'"r (32) 
G{t) = + (33) 

(34) 

Gm = { (™) " " (35) 
I m > w 

Mm = (36) 

M2 - Ml = 0- (37) 
Hardly surprising, inequality (l2Tl l is sharp for this distribution. 

III. Uniform code word generation 
Denoting the target distribution's coefficients by Fa equation ^ can be written as: 

Fa=Yl P(P) n \ (38) 
/3G/" ie/3-i(i) 

where the coefficients Fa describe the random experiment used for generating the j-th code word. If we take these independent 
of j, dSST l simplifies considerably. Introducing the source bits probability generating function 

n(<) P{P)t*'^''^^\ (39) 

we can formulate a simple but significant theorem: 

Theorem 1: If the code word generation is performed uniformly with coefficients Fa and 11 is the generating function of 
the source pattern space, then the target bit distribution is given by Fm — H {Fm)- 

This theorem is the pivot enabling us to compute the target distribution when source and code distribution are known. 
The source bit distribution acts as transformation turning the code bit distribution into the target bit distribution. The target 
distribution depends linearly on the source distribution but in a complex way on the code distribution. 

We can immediately derive the first two moments of the target bit distribution from equations (fTSl l and ( fT9] i: 

fii^n[l-I{{Fn-i)] (40) 
A2 = n [n - (2n - l)n (F„_i) + (n - l)n {Fn-2)] (41) 



M2 - Ml " 



n (F„_i) - nn {Fn-if + {n- i)n (F„_2) 



(42) 



Example 3: Let's consider fixed weight source words of weight r and binomially generated code words with parameter q. 
From dsTl) we can read off the transformation function Tr{t) = f, and by dZSl ) the coefficients are given by F„i = g"^™. Now 
our theorem implies F„i = (jf'"("^™), i.e., the target bits are binomially distributed with parameter g''. Then (l26t immediately 
implies Gm = (1 — <f)"\ and by ( |29] l the variance equals fL2 — fi\ = riq^ (1 — g''). This case is simple because the individual 
target bits are independently distributed, which is not true in general. 

By definition G„i equals the expected relative number of candidates matching a test pattern of length m. They are determined 
by statistics without reference to the original content, hence they are considered "false drops'Q and their number shall be 
minimized. Pursuing our example further, we assume that the test pattern is obtained by submitting a query source string 
of weight s to the same superimposed coding process. Any pattern of weight m in the target query space will occur with 
probabiHty p„i = (1 - q")^ qs(n-ni) ^j^^ expect a proportion oi 'd = I]m=o ("J^™?™ = ^ ) ^ + I^T 

_ ( ^\ ' ^ 

(l - ^) ^ « e-* « e-i and ^ = [1 - i (1 - g^)]" = [l - ^(1 " ^'^)\ 
relative number of false drops to exceed a proportion of i^max but have to allow for query words of length at least Sniin, then 
we must choose our target bit patterns of length n > — |lni9max|- Notice that the original bitstring length N does not enter 
at all, just the number of 1-bits r is relevant. 

Example 4: Our example above was chosen for its simplicity; we don't recommend using binomially distributed code 
words in practice. Fixed weight code words of weight w can be expected to exhibit a much more robust behavior Because 
of equations ( |40| | and (|42] | and the monotonicity of probability generating functions fixed weight code words will produce 
minimal variance in the target distribution if the expectation is prescribed. Both cases may be directly compared if the parameter 
of the binomial distribution is chosen as g = 1 — ^, because by ( l34l i F„_i — ("7")/(ri-i) ~ ^^^7^ ~ 1' Furthermore 
= = '"'-It-V^ - ^^-^ F„_, = q^ The variance 



random hits, d is minimal if we choose — —7— with d 
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hence Ini? « — — . If we do not want the 
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f,,-f4=n\q^-nq"- + {n-l)q'-{-^ (43) 
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(44) 



can be computed asymptotically for large n by developing the rightmost bracket in a power series yq — = q 

^'Z'^^^^f i ■ • ■ 5 where all higher powers may be safely discarded: 

r — ] 

9 



nq^{l 



l-r(l-g)- 



(45) 



l-q- 

the first factor being equal to the value in the binomial case. With suitable parameters the factor in square brackets may be 
quite small, i.e., the fixed weight variance may be negligible while the binomial variance is not. 

We reconsider example 3 with fixed weight code words. In Q it is showrd ?9 w (1 — g'')"'^^ \ The minimum value is 
attained for q'' « i with Inz? « — ^ (ln2)^. This value is slightly better than in example 3. We have var w (ln2)^, which 
is by a factor ^ In 2 smaller than the binomial case. 

IV. Adapting to non uniform but independent source bit distributions 

If the individual source bits are independent and the i-th bit is turned on with probability pi the probabiUty of a given source 
pattern /3 e is 

p{p) = n n (1-^^') (46) 

»e/3-i(i) »^/3-i(i) 

N 

Fa = X{{pA'^ + ^-P,) (47) 

(l47T l is obtained by inserting (l46T l into ( l38T l and distributively collecting terms. In our selection of optimal code word distributions 
we let ourselves be guided by what we learned in section Hill We set Fn^x — ^ and minimize Fn-2, observing that by 
equation ( fT9] l this is equivalent to minimization of the second moment and hence of the variance. By inequality (1211 1 -^^"^2 — 
(n — 1)^^ (^nF^^}-^ — 1^ fI^}_^, the lower bound being attained for fixed weight code words. Substituting 

+ (48) 



'This idiomatic expression is historic and derives from the application to punched cards. 

^As a matter of fact the equation is derived by using the binomial case as approximation and observing that target query weights have small variance. 



we have to solve the minimization problem 

Fn-2 = 



N 



(49) 



under the constraint 
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in the domain 
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1 —pj <Uj <1 
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2n 



(50) 

(51) 

(52) 
(53) 



[1-^1(1 -p.)] 



Observe that the terms Vj and LOj introduced for abbreviation are very close to 1, in particular Uj > 0. Ignoring the restrictions 
(fSTT l temporarily, we see that (l49b tends to +oo if at least one of the coordinates Uj does, therefore (|49] | must have an absolute 
minimum. We are going to locate it using a Lagrange multiplier and will then have to check conditions ( fsTl i. Thus 



= 



d 



N 



ln^;i_2 - A In Uj 



2 [uj — Vj (1 — pj)] Uj = A [u^ — 2uj (1 — pj) Uj + 1 ~ pj] 
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n 



- (1 -Pj) 



^ 1-- 



N 



(54) 
(55) 
(56) 



We see that ^ will be approximately the geometric mean of the quantities F^_-^ and hence close to 1. Expanding Uj = 
a + /3(A — 2) + 7 (z^j — 1) ± ■ • ■ up to linear order in the small quantities A — 2 and Vj — 1 and substituting into ( fSST l leads to 
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n-l 



1 

n 



21n2 



Pj 
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^-p^rlij^ 



(57) 
(58) 
(59) 



Summarizing: 

Theorem 2: An optimal target distribution is obtained by choosing fixed weight code words of weight 73— „jv'"^Pfc for 
encoding of bit j. 

If the probabilities pj are actually independent of j then this coincides with section |lll] 



V. The general case 

There remains the question what to do if the source bit distributions are neither uniform nor independent. We may take a 
clue from theorem 121 as long as the individual bit probabilities pj are not too large, say < 1/2, then the code word lengths 
recommended there do not vary significantly. We may try one and the same code distribution for all source bits and thus place 



ourselves in the situation of theorem [T] The generating function defined in equation (l39b is the same that would be obtained 
from an isotropic bit distribution with 

/3G7«,#/3-i(l)=m 

Choosing fixed length code words of weight n(l — q) for a suitable parameter q theorem [T] tells us that the target bits will 
have an expected weight of n(g) and we want to arrange for IV{q) — 1 ~ Gi — ^. Substituting q — e^^ with < e ^ 1 we 
derive from (fTTI l: 

oo 

Gi = l-n(e-^) = y (-l)™+i^e". (61) 

m! 

m— 1 

This equation is quite suitable for practical application, because the power series is fast converging and the lower moments of 
the source bit distribution are easily evaluated. We can solve for e: 

. = -G, + i^G? + M_A^G?±... (62) 

Ml 2/iJ 6^1 

Convergence of this power series is again fast enough to use the partial sum above as practical estimate. 
Notice that in case of source words of fixed weight r we have iim = r"^ and we recover section exactly. 
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